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A Uniform-grid Discretization Algorithm for 
Stochastic Control with Risk Constraints 

Yin-Lam Chow, Marco Pavone 


Abstract — In this paper, we present a discretization algorithm 
for finite horizon risk constrained dynamic programming al¬ 
gorithm in [1]. Although in a theoretical standpoint, Bellman’s 
recursion provides a systematic way to find optimal value 
functions and generate optimal history dependent policies, there 
is a serious computational issue. Even if the state space and 
action space of this constrained stochastic optimal control 
problem are finite, the spaces of risk threshold and the feasible 
risk update are closed bounded subset of real numbers. This 
prohibits any direct applications of unconstrained finite state 
iterative methods in dynamic programming found in [2]. In 
order to approximate Bellman’s operator derived in [1], we 
discretize the continuous action spaces and formulate a finite 
space approximation for the exact dynamic programming 
algorithm. We will also prove that the approximation error 
bound of optimal value functions is bound linearly by the step 
size of discretization. Finally, details for implementations and 
possible modifications are discussed. 

I. Introduction 

Constrained stochastic optimal control problems naturally 
arise in decision-making problems where one has to consider 
multiple objectives. Instead of introducing an aggregate 
utility function that has to be optimized, one consider a setup 
where one cost function is to be minimized while keeping the 
other cost functions below some given bounds. Application 
domains are broad and include engineering, finance, and 
logistics. Within a constrained framework, the most common 
setup is, arguably, the optimization of a risk-neutral expec¬ 
tation criterion subject to a risk-neutral constraint [3], [1]. 
This model, however, is not suitable in scenarios where risk- 
aversion is a key feature of the problem setup. To introduce 
risk aversion, in [1] the authors studied stochastic optimal 
control problems with risk constraints, where risk is modeled 
according to dynamic, time-consistent risk metrics [4], [5]. 
These metrics have the desirable property of ensuring ratio¬ 
nal consistency of risk preferences across multiple periods 
[5]. (In contrast, traditional static risk metrics, such as con¬ 
ditional value at risk, can lead to potentially “inconsistent” 
behaviors, see [6] and references therein.) In particular, in 
[1], the authors developed a dynamic programming approach 
that allows to (formally) compute the optimal costs by value 
iteration via a constrained dynamic programming operator. 
The key idea is that due to the compositional structure of 
dynamic risk constraints, the optimization problem can be 
cast as a Markov decision problem (MDP) on an augmented 
state space where Markov policies are optimal (as opposed 
to the original problem) and Bellman’s recursion can be 
applied. Henceforth, we will refer to such augmented MDP 
as AMDP. However, even if both the state space and action 
spaces for the original optimization problem are assumed 
to be finite, the augmented state in AMDP contains state 
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variables that are continuous and lie in bounded subsets of 
the real numbers. Hence, apart from a few cases when an 
analytical solution is available, the problem must be solved 
numerically. 

Accordingly, the objective of this paper is to develop a 
numerical method for the solution of stochastic optimal con¬ 
trol problems with dynamic, time-consistent risk measures. 
The approach is to discretize the continuous states in AMDP. 
Numerical algorithms for the solution of continuous MDPs 
is indeed a fairly mature field. In [7], [8], [9], multi-grid 
state/action space discretization methods are developed with 
bounds available on how fine the discretization should be 
in order to achieve a desired accuracy. In [10], the grid for 
discretization is chosen via randomized sampling techniques 
and Monte Carlo methods. In [11], the value functions are 
approximated by a finite number of basis functions. Variable 
resolution grid sampling techniques have been proposed in 
[12], [13], [14]. However, in general, these results assume 
that the dynamic programming operator is unconstrained , 
i.e., actions and future states are only constrained to lie 
in their respective feasible sets. In contrast, the dynamic 
programming operator for AMDP constrains actions and 
future states in a more complicated fashion (see SectionlHlfor 
more details). This precludes the application of current ap¬ 
proximation algorithms to the numerical solution of AMDP. 

Our approach is to extend the uniform grid discretization 
approximation developed in [9]. This requires the develop¬ 
ment of novel Lipschitz bounds for constrained dynamic 
programming operators. We show that convergence is linear 
in the step size, which is the same convergence rate for 
discretization algorithms for unconstrained dynamic pro¬ 
gramming operators [9]. The importance of our result is 
fourfold. First, we provide a sound numerical method for the 
solution of AMDP. Second, our results provide the basis to 
develop more sophisticated approximation algorithms (e.g., 
variable grid size, reinforcement learning, etc.) for the so¬ 
lution of stochastic optimal control problems with dynamic, 
time-consistent risk constraints. Third, a particular type or 
dynamic, time-consistent “risk” constraint is, of course, the 
risk neutral expectation. Hence, our results provide as a 
particular case a numerical algorithm to solve the dynamic 
programing equations that arise in traditional constrained 
stochastic optimal control problems [3]. To the best of our 
knowledge, this is the first practical algorithm to solve 
such dynamic programming equations. Finally, the ideas and 
techniques introduced in the current paper could be useful 
for the development of approximation algorithms for other 
types of constrained dynamic programming operators. 

The rest of the paper is structured as follows. In Sec¬ 
tion |TI] we present background material for this paper, in 
particular about dynamic, time-consistent risk metrics and 
stochastic optimal control with dynamic risk constraints [1]. 
In Section [III] we present and theoretically study a uniform 


grid approximation algorithm for the augmented MDP; in 
particular, we show that the error bound is linear to the 
discretization step size. In Section llVl we study by numerical 
simulations the performance of the proposed algorithm and 
discuss details of implementations using Branch and Bound 
techniques. Finally, in Section [VJ we draw our conclusions 
and offer directions for future work. 

II. Preliminaries 

In this section we provide some background for the the¬ 
ory of dynamic, time-consistent risk metrics and stochastic 
optimal control with dynamic risk constraints, on which we 
will rely extensively later in the paper. 

A. Notations 

In this paper, given a real-valued function /, dom(/) 
denotes its domain and epi/ denotes its epigraph (i.e., the 
set of points lying on or above its graph). Let v and // be 
two probability measures on the same measurable space, then 
v <C /i denotes that v is absolutely continuous with respect 
to p (i.e., v(E) = 0 for every set E for which p(E) = 0). 

B. Markov Decision Processes 


mapping p k : Z k +i —> Z k , k G {0, ..., N}, with the 
following four properties: 

• Convexity: p k {^Z + (1 — A )W) < \p k {Z) + (1 — 
X)p k {W), VA G [0,1] and Z, W G Z k+1 ; 

• Monotonicity: if Z < W then Pk{Z) < p k (W), 
VZ, W G Z k+1 ; 

• Translation invariance: p k {Z+W) = Z+p k (W), fZ G 
Zk and W G Z k+1 ; 

• Positive homogeneity: p k {AZ) = Xp k (Z), MZ G Z k +i 
and A > 0. 

Then, the following results characterize dynamic, time- 
consistent risk metrics [4], 

Theorem II.2 (Dynamic, time-consistent risk measures). 
Consider, for each k G {0, • • • ,N}, the mappings p k ,N ■ 
Zk,N —> Zk defined as 

Pk,N = Zk + Pk(Z k -(-1 + pk+l{Z k +2 + .. • + 
Pn-2(Zn-i + pn-i(Zn)) ■ ■ ■)), 

where the p k ’s are coherent one-step risk measures. Then, 
the ensemble of such mappings is a time-consistent dynamic 
risk measure. 


A finite Markov Decision Process (MDP) is a four-tuple 
(S,U,Q,U(-)), where S, the state space, is a finite set; U, 
the control space, is a finite set; for every x G 5, U(x) C U 
is a nonempty set which represents the set of admissible 
controls when the system state is x; and, finally, Q(-\x,u) 
(the transition probability) is a conditional probability on S 
given the set of admissible state-control pairs, i.e., the sets 
of pairs (x,u) where x G S and u G U(x). 

Define the space ///, of admissible histories up to time 
k by Hk = Hk-i x S x U, for k > 1, and Hq = S. 
A generic element ho,k £ H k is of the form ho tk = 
(xo,uo, ■ ■., Xk-i, Uk-i, Xk). Let II be the set of all de¬ 
terministic policies with the property that at each time k 
the control is a function of In other words, II := 

: Hq —> U, 7T1 : Hi —> (7, . . ■}|7Tfc(/lo i fc) G 
U{x k ) for all h 0 ,k £ H k , k > oj. 


In this paper we consider a (slight) refinement of the 
concept of dynamic, time-consistent risk measure, which 
involves the addition of a Markovian structure [4]. 

Definition II.3 (Markov dynamic risk measures). Let V := 
L p (S 1 B , P) be the space of random variables on S with 
finite pth moment. Given a controlled Markov process {x k }, 
a Markov dynamic risk measure is a dynamic, time-consistent 
risk measure if each coherent one-step risk measure p k ■ 
Zk+i Zk in equation 0 can be written as: 

Pk{V(x k + 1 )) = ak{V(xk+i),Xk,Q{xk+i\x k ,Uk)), (2) 

for all V{xk+ i) G V and u G U(x k ), where a k is a coherent 
one-step risk measure on V (with the additional technical 
property that for every V{xk+i) G V and u G U(xk) 
the function x k a k (V(xk+i),Xk,Q(xk+i\xk,Uk)) is an 
element of V). 


C. Dynamic, time-consistent, risk measures 

Consider a probability space (f l,P,P), a filtration T\ C 
P 2 • • ■ C P\r C T, and an adapted sequence of random 
variables Z k , k G {0, • • • ,N}. We assume that Pq = {f2, 0}, 
i.e., Z() is deterministic. In this paper we interpret the 
variables Z k as stage-wise costs. For each k G {1, • • • , N}, 
define the spaces of random variables with finite pth order 
moment as Z k := L p (Cl, P), P £ [l,oo]; also, let 
Z k ,N := Zk X • • • X Zjy. 

Roughly speaking, a dynamic risk measure is said time 
consistent if it is such that when a Z cost sequence is deemed 
less risky than a W cost sequence from the perspective of a 
future time k, and both sequences yield identical costs from 
the current time l to the future time k, then the Z sequence 
is deemed as less risky at the current time l. It turns out that 
dynamic, time-consistent risk metrics can be constructed by 
“compounding” one-step conditional risk measures, which 
are defined as follows. 

Definition II. 1 (Coherent one-step conditional risk mea¬ 
sures). A coherent one-step conditional risk measures is a 


In other words, in a Markov dynamic risk measures, the 
evaluation of risk is not allowed to depend on the whole past. 


D. Stochastic optimal control with dynamic, time-consistent 
risk constraints 


Consider an MDP and let c : S x U —> R and d : S x 
U —> R be functions which denote costs associated with 
state-action pairs. Given a policy n G II, an initial state 
xq G S, and an horizon N > 1, the cost function is defined 


as 


Jn(xq) ■= E 


2-^k=0 


c{x k ,Uk) , 


and the risk constraint is defined as 

Rn(x o) := Po,n (d{ xo, d(x N - 1 , ujv-i), o), 

where p k ,n{-), k G {0,..., N— 1}, is a Markov dynamic risk 
measure (for simplicity, we do not consider terminal costs, 
even though their inclusion is straightforward). The problem 
is then as follows: 




Optimization problem OPT — Given an initial 
state xq £ S, a time horizon N > 1, and a risk 
threshold rg £ R, solve 


min Jn(xq) 

7tGII 

subject to R^(xo) < rg. 


If problem OPT is not feasible, we say that its value 
is oo. In [1] the authors developed a dynamic program¬ 
ing approach to solve this problem. To define the value 
functions, one needs to define the tail subproblems. For a 
given k £ {0,..., N — 1} and a given state x k £ S, we 
define the sub-histories as hk,j '■= (xk,Uk, ■ ■ ■ ■ Xj) for j £ 
{k, ...,7V}; also, we define the space of truncated policies 

as II fe := j{7Tfc,7r fe+ i,.. ■}|'7Tj(/ifc,j) e U(xj) for j > 

k>. For a given stage k and state Xk, the cost of the 


tail process associated with a 

JnM '■= E c ixj,uj) 


joltcy 7r £ Ilfc is simply 
. The risk associated with 


the tail process is: 


where Fk is the set of control/threshold functions'. 
Fk{xk,rk ) :=|(u,r') u £ U(xk),r'(x') £ <b fc+ i(x') for 

all x' £ S, and d(x k ,u) + p k (r'(x k + 1 )) <r k \- 


If Fk(x k ,r k ) = 0, then T k [V k+ i\(xk, r k ) = oo. 

Note that, for a given state and threshold constraint, set 
Fk characterizes the set of feasible pairs of actions and 
subsequent constraint thresholds. 

Theorem II.4 (Bellman’s equation with risk constraints). 
For all k £ {0, ...,7V — 1} the optimal cost functions satisfy 
the Bellman’s equation: 

V k {x k ,r k ) = Tk[Vk + i\{x k ,r k ). 

E. Representation theorems 

A key result that will be heavily exploited in this paper is 
the following representation theorem for coherent one-step 
conditional risk measures. 


K^r( x k) ■= p k ,N 



mjv-i), 0 


Theorem II.5. pt- : Z k +1 —> Z k is a coherent one-step 
conditional risk measure if and only if 


The tail subproblems are then defined as 

min J^{x k ) (3) 

7ren fc 

subject to RT N (xk) < r k {x k ), (4) 

for a given (undetermined) threshold value r k (xk) £ R (i.e., 
the tail subproblems are specified up to a threshold value). 

For each k £ {0,..., TV — 1} and x k £ S, we define the 
set of feasible constraint thresholds as 

4 h(xk) ■= \R N (xk), RN,k], ■= {0}, 

where R N (x k ) := min we n fe RT N {x k ), and R Nik = (TV - 
k) p,riax■ The value functions are then defined as follows: 

• If k < TV and r k £ $k(xk)'- 

V k {x k ,r k ) = min J^ixk) 

7ren fe 

subject to Rft(xk) < r k - 
« il k < TV and r k (f $>k(x k )'. 

Vk(x k ,rk) = oo; 

• when k = TV and rjv = 0: 

Vn{xi v,rjv) = 0. 

Let B(S) denote the space of real-valued bounded func¬ 
tions on S, and B(S x R) denote the space of real-valued 
bounded functions on S x R. For k £ {0,..., TV — 1}, 
we define the dynamic programming operator Tk\Vk+i} '■ 
B(S x R) i B(S xl) according to the equation: 


-ffc[f’fc-t-i] {x k ; Xk) •— inf s c(x^,it) T 
(■ u,r')£F k (x k ,r k ) ( 

^ ^ Q{Xk -(-1 \ x k , u) Vk-\-l (Xk-\- 1} T (Xk+l )) r > 
x k+1 es ' 

( 5 ) 


Pk(Z(x k +i)) = sup ^2 £( x')Z(x') 

£eU k+ 1 (x k ,Q(x k+ 1 \x k ,u k )) x , eS 


(6) 

where 

^6c+l (*£/c 5 Q') 

U e M u « Q, E &) z &) < Pk(Z), vz £ z k+1 1 

l x'es ) 

and 

M=U£ R |s| | ^ f(x') = 1, f(x') > 0, Vx' £ S \ . 

I x'es J 

For Z £ Z k C L p (fl,F, P), Uk+i{x k ,Q) is a subset 
of L q (n,F, P), where L q (fl,F,P) is the dual space of 
L p (n , F, P), for 1/p + 1/q = 1 and p,q £ [1, oo]. 

Proof Refer to Theorem 6.4 in [5] and references therein. 

□ 

The result essentially says that any coherent risk measure 
can be interpreted as an expectation taken with respect to a 
worst-case measure, which is chosen from a suitable set of 
test measures [6]. 

Furthermore, by Moreau-Rockafellar Theorem (Theorem 
7.4 in [5]), it implies U k+ i(x k , Q(x k+ i\x k , u k )) = dp k (0), 
when the transition probability kernel is Q(x k +i\xk, u k )- 
The next Theorem implies a basic duality result on coherent 
risk measures. 

Theorem II.6. p k Z k + 1 —> Z k is a coherent one- 
step conditional risk measure, if and only if there exists 
a bounded, non-empty, weakly* compact and convex set: 
Uk+i(xk,Q(xk+i\xk,Uk)) such that equation (|6]) holds. 
Furthermore, if p k is a coherent risk measure, then it is con¬ 
tinuous and sub-differentiable in Z k + \, also if dom(pk) = 
{Z £ ^k -)-i • Pk (^) has an non-empty interior, then 

Pk is finite valued. 





Proof. See Proposition 6.5, Theorem 6.6 and Theorem 6.7 
in [5], □ 

Since the analysis of this paper is restricted to 
finite state and action spaces, from this theorem, 
Uk+i(xk,Q(x k +i\xk,Uk)) = dp k ( 0) is a non-empty, 
convex, bounded and compact set in Rl s L By extreme value 
theorem, the supremum in equation © is attained. 

III. Discretization of the continuous risk 

THRESHOLDS IN CONSTRAINED DYNAMIC PROGRAMMING 

In the previous section, we have shown that the constrained 
stochastic optimal control problem can be solved using value 
iteration (See Theorem 111.4b . However, the constant risk 
threshold r k in value function V k {x k: r k ), k £ {0,..., TV—1} 
is a continuous state. This results in numerical complexity 
when value iteration is performed. Therefore, in this section, 
we consider a numerical approximation algorithm using 
discretization. First of all, from the dynamic programming 
operator possess several nice properties: 

Lemma III.l. Let V. V £ B(S xl) be real-valued bounded 
functions and : B(S x R) >—> B(S xl) be a dynamic 

programming operator in given in B(S xl) whose expres¬ 
sion is given by equation (0 for any k £ {0,..., TV — 1}. 
Then, the following statement holds: 

1) Monotonicity: For any (x,r) £ B(S x R), ifV < V, 
then T k [V]{x,r) < T k [V]{x,r). 

2) Constant shift: For any real number L and (x, r ) £ 
B(S x R), T k [V + L\(x , r ) = T k \V}(x , r) + L, where 
(V + L)(x, r) := V(x , r) + L, V (x, r) £ B(S x R). 

3) Non-expansivity: For all V,V£ B(S x R), ||Tfc[V] — 
T k [V] ||oo < \\V - V'lloo, where || • is the infinity 
norm of a function. 

Next, we introduce the method for constrained dynamic 
programming with discretized risk thresholds and updates. 

A. Dynamic programming with discretized risk thresholds 
and updates 

For k £ {0,..., TV — 1}, we will partition <& k {x k ) into 
t + 1 partitions using t grid points: {f^}\ ... ,r^} for every 
fixed x k £ S. The step size of discretization of the risk 
thresholds r k is A. For r £ {0,..., t}, define the discretized 
region $^\x k ) = [r^\ r[ r+1 ^), where r[ 0) = R N {x k ) and 
r A* +1 ^ = -Riv.fc + e, for arbitrarily small e > 0. We also 
define & k (x k ) = {r^°\..., to be a finite state of 

risk threshold at step k. Let r £ {0,..., t} such that r k £ 
^(x k ). Now, define the approximation operator k for 
x k £ S, r k £ ^\x k ): 

Tl k [V]{x k ,r k ) ~T° k [V\(x k ,rP) (7) 

where 

TA,kl v ]( x k,rk) ■= n min \c{x k ,u) 

(u,r D ’ , )eFf’(x k ,r k ) [_ 

+ Q(x'\x k ,u)V(x',r D ’ , {x , ))\, 

x'GS ) 

( 8 ) 


where Fjf is the set of control/threshold functions'. 

F k (x k ,r k ) := u £ U{x k ), r D ’'{x') £ $ k+1 (x'), 


Vx' € S, d(x k ,u ) + p k {r D ’'(x k+ 1)) < r k |. 


If F»{x k ,r k ) = 0, then T^ k [V k+1 ](x k ,r k ) = oo. 

By construction, we can see the set of optimal solution of 
T& k [V](x k ) r k ) is a subset of feasible space for the problem 
described by T k [V](x k ,r k ) (since Fj°(x k ,r k ) C F k (x k ,r k ) 
and r[, T < r k ). Because the solution of k [V](x k ,r k ) 
is an infimum over a finite set, the problem in 0 is a 
minimization. Also, based on similar proofs, the dynamic 
programming operator T^ k [V] satisfies all the properties 
given in Lemma IIII.ll The main result of this section is to 
obtain a bound of the differences between T k [V](x k , r k ) and 
k [V](x k ,r k ), which will be given in the next subsection. 

B. Error bound analysis 

First, we have the following assumptions for the following 
analysis: 

Assumptions for discretization analysis: 

1) There exists M Cl Md > 0 such that 

|c(x, u) — c(x,i2)| < M c \u — m|, 

| d(x, u) — d(x, u)| < Md\u — u|, 

for any x £ S, u,u £ U(x). 

2) For any u,u £ U(x k ), there exists M q > 0 
such that 

y \Q(x'\x k ,u) - Q{x'\xk,u)\ < M q \u-u\. 

x'GS 

Assumptions 0 to 0 are the critical assumptions required 
to perform error bound analysis in this section. First, we 
have following Proposition showing the Lipschsitz-ness of 
set-valued mapping U k+ i(x k , Q). 

Proposition III.2. For any £ £ U k \ \ (x k , Q), there exists a 
> 0 such that for some t; £ U k +i(x kl Q), 

1 ^®') - ^ M n \q( x ') - Q( x ') ■ 

x'£S x'ES 

Proof. From Theorem III. 61 we know that U k+ \ (x k , Q) is a 
closed, bounded, convex set of probability mass functions. 
Since any conditional probability mass function Q is in the 
interior of dom(74 + i) and the graph of U k +i(x k ,Q) is 
closed, by Theorem 2.7 in [15], U k +i{x k ,Q) is a Lipschitz 
set-valued mapping with respect to the Hausdorff distance. 
Thus, for any £ £ U k +i (x k ,Q), the following expression 
holds for some > 0: 

- inf - ^2 \Q( x ') ~ Q( x ') ■ 

£eUk+i(xk,Q) x , eS j'gs 

Next, we want to show that the infimum of the left side 
is attained. Since the objective function is convex, and 
U k +i(x k , Q) is a convex compact set, there exists £ £ 
T7fc_|_i(xfc, Q) such that infimum is attained. □ 

Next, we provide a Lemma that characterizes an upper 
bound for the magnitude of the value functions. 







Lemma III.3. For k £ {0,..., N — 1}, the following bound 
is given for the value function V k {x k ,rk): ||I4||oo < (N — 
fc)Cmax, where 

Cmax~ , max |c(x,m))|. (9) 

(x,u)(£SxU 

Proof First, from the definition of Vn(xn,xn), we know 
that Vn{xn,Tn) = 0 for any xn £ S, tn £ $at(xat). 
Therefore, the above inequality holds for the for k = N. For 
j £ {0,..., N — 1}, since | c(xj,Uj)\ < c max for any Xj £ S, 
Uj £ U(xj), it implies \\Tj [Vjv]H oc < c,^. Furthermore, 

ll^ lloo =||^ - VzvHoo 

<11 Tj[V j+ i] - Tj[V N }\|oo + || Tj[V N ] - VnWoo 

— 11 Fj + 1 Vn 11 oo T Cmax — 11 Vj +111 oo Cmax • 

The first inequality is due to triangle inequality and Theorem 
HL41 the second inequality is due to the non-expansivity 
property in Lemma IIII.ll and both equalities in the above 
expression are due to F'v (x, r) = 0. Thus by recursion, we 
get 

JV-l 

Halloo = E (Halloo -ll^+llloc). 

j=k 


the definition of £* £ Uk+i{x k , Q(x k +i\x k , u)). The second 
inequality is due to the fact that £ is a probability mass 
functions in U k +i(x k , Q(x k +i \Xk, u)). Then, by Proposition 
IIII. 21 there exists Aft > 0 such that 

X! I£*(£')-?(>')I < m £ \Q(x'\x k ,u)-Q(x'\x k ,u)\. 

x'€S x'€S 

Furthermore, by Assumptions © to © and the definition of 
$fc + i(xfe + i), expression ( ITTb implies 

&(u,r')—a(u,r') < Ma,u I |€t — u\ + ^ |r' (x') — r'(x') | 

V i'es / 

where 

M A ,k = max {Md + l} . 

By a symmetric argument, we can also show that 

I \u — u\ + ^2 |f , (x / ) — r'(x')\ 

V x'GS ) 


and the proof is completed by noting that | V'j-11 ^ — 
ll^+illoc) < C max for j £ {k, ...,1V— 1}. □ 

To prove the main result, we need the following technical 
Lemma. 

Lemma III.4. For every given Xk £ S and r k ,rk £ <bfc(xfc), 
suppose Assumptions ([7} to © hold. Also, define rf := 
{r'{x')} x ' eS £ R |S| and r := {f'(x')}x'eS £ R |S| - If 
Fk(xk,rk) and Fk{xk,fk) are non-empty sets, then for any 
(u,rf) £ Fk(xk,Vk), there exists (u,r) £ Fk(xk,fk) such 
that for some M r k > 0, 

\u - u\ + ^2 |r'(x') - f'(x')| < M ri k\r k - r k |. (10) 

x'ES 

Proof First, we want to show that cn(u,r') := 
d(xk,u ) + pk(r'(xk+ 1 )) is a Lipschitz function. Define 

{^(zOW e arg m a x i&Uk+i(xkMxk+ilxk u)) ^d{x kl u) + 

£(x')(r'(x'))|. Then, there exists a £ £ 

Uk+i{xk, Q(xk+i \xk, u)) such that the following 
expressions hold: 

a(u,rf) — a(u,f') 

=d(x k ,u) + p k (r'(x k + 1 )) - d(x k ,u) - pk(f'(x k + 1 )) 
<d(x k ,u) - d(x k> u) + ^ (£*(x') - £(x'))r'(x') 

x'€S 

+ £( x ')(rV) - r'{x')) 

x'GS 

<\d(x k ,u) - d(x k ,u )| + ^2 \ r '( x ') “ r'{x')\ 

x'GS 

+ max|r'(x)| J2 l£*0') - 1(^)1 • 

xGS z —' 

x'€S 

(ii) 


Thus, by combining both arguments, we have shown that 
a(u,r') is a Lipschitz function. Next, for any ( u,r ') £ 
Fk{x k ,r k ), where 


F k (x kl r) = u £ U(x k ), r'(x') £ $ k +i(x’), 

Vx' £ S , a(u, r 1 ) < r 1, 


consider the following optimization problem: 


P 


Xk ,it,r 


(r) = inf 

(u,f')eF k (x k ,r) 


|u — u\ + ^2 \r{x') — r'(x')\. 

x'GS 


Since (u,r') is a feasible point of F k (xk,rk), Px k ,u,jg{rk) = 
0. By our assumptions, both U{x k ) and <l>fc + i(xfc+i) are 
compact sets of real numbers. Note that both \u — u\ + 
E^ eS r(x')-r'(x')l and a(u, r') are Lipschitz functions 
in ( u,r'). Also, consider the sub-gradient of f(u,r',r) := 
a(u,r') — 70: 


df(u,r',r) = 


(1 { 

'9 1 
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r)Gdom(/) l 

S3. 


x R |s| 


/(u,f',f) > f(u,r,r ) + 


9 1 

92 
S3. 



x K : 



Since — r is differentiable on r, the third ele¬ 

ment of df(u,r',r) is a singleton and it equals to {— 1}. 
Next, consider the sub-gradient of h(u,F,r) = |u — u\ + 
Y^x’eS W(x')-r'(x')\. By identical arguments, we can show 
that the set of the third element of dh(u,r ,r) is a singleton 
and it equals to {0}. Therefore, Theorem 4.2 in [16] implies 
Px k ,u,r'(r) is strictly differentiable (Lipschitz continuous) in 


The first equality follows from definitions of coherent risk i A sub . gradient of a function / ; ^ ^ R at a point x 0 e X is a real 
measures. The first inequality is due to Theorem III. 51 and vector g such that for all x e X, f(x) - f(x o) > g T (x - xo), Vx g X. 














rQ. Then, for any ( u,r ') G F k (x k ,r k ), there exists M r )k > 0 
such that 


inf 

(u,r')eF k (x k ,r k ) 


|u-u|+^ \r'{x')—f'{x')\ < M rtk \x k -r k \. 

x'€lS 


Finally we want to show that the infimum on the left 
side of the above expression is attained. First, \u — u\ + 
s \ r '( x ') ~ ' s coercive and continuous in (u,£ 7 ). 

By Example 14.29 in [17], this function is a Caratheodory 
integrand and is also a normal integrand. Furthermore, since 
F k (x k ,r k ) is a closed set (since U(x k ) is a finite set, 
<J>fc +1 (xfc_|_i) is a compact set and the constraint inequality 
is non-strict)and a(u,r') — f k is a normal integrand (see 
the proof of Theorem IV.2 in [1]), by Theorem 14.36 and 
Example 14.32 in [17], one can show that the following 
indicator function: 


lx k (u,r!,r k ) 


0 if (w,r') G F k (x k ,r k ) 

oo otherwise 


is a normal integrand. Furthermore, By Proposition 14.44 in 
[17], the function 

9 x k (.u,r',f k ) := \r'(x')-r’ (x')\+l Xk {u,r' ,f k ) 

x'€S 

is a normal integrand. Also, inf^ g Xk (u, r 7 , f k ) = 

™f(u,r')eF k ( Xk ,r k ) |w - «| + Ex'eS \ r> ( x> ) ~ ?(x')\. By 
Theorem 14.37 in [17], there exists ( u 7 x ) G F k (x k ,x k ) 
such that ( u , r ) argmin ( u, r 7 , f k ). Furthermore, the right 
side of the above equality is finite since F k (x k ,r k ) is a 
non-empty set. The definition of l Xk (it, f 7 , f k ) implies that 
(w,r 7 ) G F k (x k ,f k ). Therefore this implies expression (fTot 
holds for any {u 7 x') G F k (x k ,x k ). □ 

The following Lemma provides a sensitivity condition for 
the value function V k (x k ,x k ). 

Lemma III.5. Suppose F k (x k ,x k ) and F k (x k: x k ) are non¬ 
empty sets for k G {0,..., N— 1}. Then, for x k G S, r k , r k G 
4>fc(xfc), such that r k > f k , k G {0,..., N}, the following 
expression holds: 


0 < V k (x k ,f k ) - V k (x k ,x k ) < M Vtk (r k - f k ) (12) 

where M Vjk = (M c + M q (N-k-l)c max + M Vjk+1 )M rtk > 
0, and My,N = 0. 

Proof First, for k G {0,..., N — 1}, when f k < r k , by 
Lemma IV. 1 in [1], we know that V k (x k ,x k ) > V k (x k ,x k ). 
The proof is completed if we can show that for x k < x k , 


V k (x k ,x k ) - V k (x k ,x k ) < M Vtk (x k -r k ). 

First, at k = N, for any Xn,xn G <&n{xn), we get 
Vn(xn,Xn) = Vn{xn,xn) = 0. Inequality (fl2] | trivially 
holds for any My,N > 0. By induction’s assumption, sup¬ 
pose there exists Myj+i > 0 such that following inequality 
holds at k = j + 1: 


\Vj+i{x,f j+ i) - V j+ 1 (x,r j+1 )\ < M Vtj+ 1 \r j+1 - r j+1 \. 

for any x G S. Then, for the case at k = j, by Theorem 
IV.2 in [1], the infimum of Tj[Vj+ 1 ] is attained. From 


2 Theorem 4.2 in [16] implies both 9P Xk ^ u y(r) ) d ca P XkU y{r) C 
{0} for r k € *h k (r k ). This result further implies P XkiU y ( r ) is strictly 
differentiable. For details, please refer to this paper. 


Theorem III.4I Vj(xj,Xj) = Tj[Vj+i](xj, Xj). For any given 
Xj G S, Xj G &j(xj), let ( u*,x *’ 7 ) be the minimizer of 
Tj[Vj + i](xj, Xj). Then, there exists ( Uj,f' ) G Fj(xj,rj), 

such that inequality ( fTOt and the following expressions hold: 

Vj (Xj ,Xj) — Vj (Xj , Xj) 

<c(Xj,Uj) - c(Xj,U*) + ^2 Q( x '\ x jiUj)Vj + i{x',x\x')) 
x'GS 

- Y Q(x'\xj,u*)V j+ i(x 7 ,r*’ 7 (x 7 )) 

x'GS 

=c(Xj, Uj) - c(Xj,U *) 

+ Y Q( x '\ x j^i) (Vj+iix^x'ix')) - v j+1 (x',x*’'(x'))) 

x'£S 

+ Y (Q(x'\xj,Uj) - Qix'lxj,^)) Vj +1 [x', x*’’{x')) 

x'GS 

— Il^jf+l II oo ^ ] I Q{ x I Xji'U'j) — Q(x 

x'ES 

+ Y { lE+i^V*’ 7 ^ 7 )) -V j+1 (x',x{x'))\ 

+ \c(Xj,Uj) -c(Xj,U*)\. 

The first inequality follows from the definitions. The second 
inequality follows from Ex'eS Q(x'\xj,Uj) = 1 and the 
definition of ||Fj+i||oo and c max - From Assumption ([TJ and 
Inductions’ assumption, the above expression further implies 

O) - v A x P r o) 

<(M C + MqWVj+lWoo^Uj — u*\ 

+ M V ,j +1 Y \r'i x ') - r*’’{x')\ 

x'GS 

<{M C + MqljVi-pilloo + Mv,j+i)M r j\xj — rf. (13) 

The last inequality is simply resulted from by Lemma IIII.4I 
In addition, from Lemma UlI. 31 we get 

N-l 

IIE+l||oo= 5Z Halloo -||^ +1 ||oc<(iV-i-l)c max . 

i=j +1 

Then, by applying this inequality to the expression derived 
in the previous part of the proof, we get 

'.'/(•'■-./TO) - l;,(.r ( ./;,) (14) 

4 ( A7 r . -[- M q ( A’ j l)Cmax T Mv.j+1 ) A7 t .j | Xj Xj |. 

Thus by induction, expression ( IT2l) holds. □ 

The next Lemma shows that the difference between 
dynamic programming operators k [V k+ i](x k , r k ) and 
T k [V k+1 ](x k ,x k ) is bounded. 

Lemma III.6. For any x k G S, x k G & k (x k ), the following 
inequality holds for k G {0,..., N — 1}.' 

0 < T^ k \V k+ i](x k ,r k ) - T k [V k+1 ](x k ,r k ) < M V:k+1 A 

where My, k +i > 0 is given by Lemma \III.5\ and A is the 
step size of the discretization of risk threshold x k . 

Proof First, by the definition of Fjf {x k , x k ), we know 
that F k (x k ,x k ) C F k (x k ,x k ). Since, the objective func¬ 
tions and all other constraints in T® k [V k +i](x k , x k ) and 








Tfc[Vo l fc+ 1 ](a:*;, r k ) are identical, we can easily conclude that 
TE tk [Vk+i](x k ,r k ) > T k [Vk+i](xk,r k ) for all x k £ S, 
r k £ The proof is completed if we can show 

TR,k\Yk+i]{x k ,r k ) - T k [V k+1 ](x k ,r k ) < M Vtk+1 A. 

By Theorem IV. 2 in [1] we know that the infimum of 

Tk[Vk+i](x k ,r k ) is attained. Let (u* k ,r*’') £ F k (x k ,r k ) be 
the minimizer of T k [V k +i](x k , r k ). Also, for every fixed x' £ 

S, let r( x') £ {0, ...,<} such that r*''{x') £ 

Now, construct 

r\x') :=r<$'»£<YV)- 

By definition of <bfc + i(x'), we know that r'(x') £ 34+1 (x 7 ), 
Vx' £ 5 1 . Since ^ is the lower bound of ^(x'), 
we have ^ < r*’'{x'). Furthermore, since the size of 

' > \ x ') i s A, we know that |rj^“ ^ — r*’'(x')| < A for 
any x' £ 5. By monotonicity of coherent risk measures, 

d(x kl u* k ) + pk{f'(x k + 1 )) < d(x k ,u* k ) + p k (r*’'(x k+ 1 )) < r k . 

Therefore, we conclude that ( u k ,r') £ F k (x k ,r k ) is a 
feasible solution to the problem in k [V k +i](xk, r k ). From 
this fact, we get the following inequalities: 

TA, k [ v k+i\( x k, r k )-T k [Vfc+i] (x fc , r k ) 

< E Q(x'\xk,u k )fv k +i(x , ,r , (x')) — V k+1 (x',r*’'(x'))\ 

x’es ' ' 

< sup ( |14 + i(x',f'(x')) - V k+ i(x', r*’ , (x , ))| 1 

i'es l J 

<M V)k+ 1 sup |r'(x') - r*’'(x'))l < My k+1 A. 

x'ES 

The first inequality is due to substitutions of the feasible 
solution of k [V k yi](x k ,r k ) and the optimal solution of 
Xfc[T4 + i](xfc, r k ). The second inequality is trivial. The third 
inequality is a result of Lemma llll.5l and the fourth inequality 
is due to the definition of f'(x'), for all x' £ S. This 
completes the proof. □ 

The following Lemma is the main result of this section. It 
characterizes the error bound between the dynamic program¬ 
ming operator T k [V k+1 ]{x k ,r k ) and T^ k [V k+1 ](x k ,r k ). 

Lemma III.7. Suppose Assumptions 0 to 0 hold. Then, 
there exists a constant My, k > 0 such that 

\\T^ k [V k+ i} - T k [V k+1 ] II* < (Mv, fc + M Vik+1 ) A (15) 

where k [V k +i](x, r) is defined in equation 0, A is the 
step size of the discretization of risk threshold r k and the 
expression of My k , My k+ i > 0 is given in Lemma mm 
for k £ {0,..., N - 1}. 

Proof For any given x k £ S' and r k £ & k (x k ), let r £ 
{0,... ,t} such that r k £ ^ k \x k ). Then, by the definition 
of k [V k +i](x k , r k ) and Theorem III. 41 the following ex¬ 
pression holds: 

\Tl k [V k+1 ](x k ,r k ) - T k [V k+1 ](x k ,r k )\ < \V k (x k ,r[j ] )- 
V k [x k ,r k )| + \ f^ k [Vk+i}(x k ,r[ T) ) - T k [V k+1 ](x k ,r[ T) )\. 


Also, by using Lemma Hll.51 and [ill. 61 the above expression 
implies that 

| ^A,k [14+ 1 ] ( x k i t’fc ) l k [14+1 ] ( X k , T k ) | 

<M Vtk+1 A + My k \r k — r^ \ < (M Vtk + M Vtk+1 ) A. 

The last inequality follows from the fact that r k £ ^ k \x k ) 
implies |— r k \ < A, where r^ is the lower bound of 
the discretized region of risk threshold: ^ k \x k ). By taking 
supremum of x k £ 5 and r k £ $ k (x k ) on both sides of 
the resultant inequality, we conclude the inequality given in 
expression (El). □ 

Next, define M r = max*. g { 0 jv-i} M r , k . The following 
Theorem provides an error bound between the value func¬ 
tion: V k (x k ,r k ) and the value function with discretizations: 

V k D {x k ,r k ). 

Theorem III.8. Define V^(x k .r k ) := Tg k [V^ +l }(x k ,r k ), 
k £ {0,..., N— 1} as the value function with discretized risk 
threshold/update where V^(xjv,rjv) := Vn{xn,tn) = 0. 
Suppose Assumptions 0 to 0 hold. Then, 

II T rD T/ II ^ O A ( ( M r MqCmax ~ M c ( 1 — M r ))(l — Mjf) 

1114 -14||oo<2A^-(1 - Mr) 3 - 

N(N -l)M r M q c max N(M C (1 — M r ) — M q M r c max ) \ 
+ 2(1 - M r ) + (1 - M r ) 2 ) 

where A is the step size of the of risk threshold discretization. 

Proof. From Theorem IIII.7I we know that for j £ 
1}, || T^[V j+1 ] - Tj[V j+1 ] |U < (M VJ + 
Myj+ i)A, where A is the step size of the discretization 
of risk threshold r : j. Therefore, we have the following 
expressions: 

11^° - Ij lloo = \\T2 0 [V 0 D +1 ] - Tj[V j+1 ]|U 

<\\TEA V i+i\ T a, j -[^ + i]IIoc + II T£ d [V j+1 \ - Tj[V j+1 ]|U 

— Il^i+l — 14+l||oo + ( My t j + Myj + l)A. 

The first equality is due to Theorem III. 41 and the fact that 
Vj p (;vj,rj) = T^^V/lfix-j.j-j). The third ine qualit y is 
based on the non-expansivity property in Lemma IIII.ll and 
the arguments in Theorem IIII. 71 Furthermore, 

JV-l 

\\v k D - Vfclloc = E (ll^f - ^IU - ll^+t - Vmlloo) 

i—k 

< | 'y ( Myj + Myj+I J A < 2 | y ] A Lyj J A. 

\j=k J \j — 0 / 

Therefore, the proof is completed by summing the right side 
of the inequality from 0 to N — 1 and combining all previous 
arguments. □ 

As the step size A —> 0, for any x k £ S and r k £ $ k {x k ), 
this Theorem implies that V k > (a: k ,r k ) —> V k (x k ,r k ). 

Remark III.9. Unfortunately, similar to all multi-grid dis¬ 
cretization approaches discussed in [9], [11], [8], the multi¬ 
grid discretization algorithm in this paper also suffers from 
the curse of dimensionality. Suppose the number of dis¬ 
cretized grid used is |i?|. For each time horizon, the size of 
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state space is |Sj|i?|. However, the size of the action space 
is |j4|(|i?|)l 5 L Methods such as Branch and bound or rollout 
algorithms can be applied to find the minimizers in each step 
to alleviate this issue, if the upper/lower bounds of the value 
functions are effectively calculated. 


IV. Numerical Implementation 


Consider an example with 3 states (x £ {1,2,3}), 2 
available actions (u £ {1,2}) with time horizon N = 3. 
The costs, constraint costs and transition probabilities are 
given as follows: 


c(l,l) c(l, 2) 


1 3 


d( 1,1) d(l,2) 

1 

5 4 

c(2,l) c(2,2) 

c(3,l) c(3,2) 

— 

2 4 
5 6 

5 

d{ 2,1) d( 2,2) 
d(3,l) d( 3,2) 

“ 10 

6 3 
5 1 


0.2 

0.5 

0.3 


0.3 

0.5 

0.2 

0.4 

0.3 

0.3 

, Q(x'\x,2) = 

0.2 

0.3 

0.5 

0.3 

0.3 

0.4 

0.3 

0.4 

0.3 


For any Xq £ S and ro £ $ 0 (^ 0 ), the risk sensitive con¬ 
strained stochastic optimal control problem we are solving 
is as follows: 


min E 

7rGll 


Efc=0 c ( x k , Uk ) 


subject to po t3 ^d(xo,uo),d(xi,ui),d(x 2 ,U 2 ),oj < r 0 . 


where u k = n k (h 0ik ) for k £ {0,1,2}, 

Po,n(Zq, Zi, Z 2 , Zf) = Zq + po(Zi + p\{Z 2 + p 2 {Zf))) and 


/ \ r / 2 

p k (V) = E[V]+0.2(E[[V-E[V}} 2 + ]) . 

First, this problem can be re-casted using multi-stage con¬ 
strained dynamic programming using the methods described 
by Theorem IV.3 in [1], Furthermore, based on equations 
© to ©, we can approximate the optimal value function 
using risk threshold/update discretization. In this example, 
we discretize every risk threshold sets into M regions, where 


M £ {5,10,20,40,60,80,100,150}. 


With different sizes of risk threshold discretization, we get 
approximations of optimal value functions, up to various 
degrees of accuracies. Figure [T] shows both the approxima¬ 
tions of value function using various step sizes and their 
errors of approximations. As the number of M increases, 
the approximated value function converges towards the true 
optimal value function. However, as discussed in Remark 
IIII. 91 the size of action space increases exponentially with the 
number of states, thus it makes enumerating all state/action 
pairs during value iteration computationally expensive. 


V. Conclusion 

In this paper we have presented and analyzed an uniform 
grid discretization algorithm for approximating the Bellman’s 
recursion for finite horizon constrained stochastic optimal 
control problems. Although the current algorithm suffers 
from curse of dimensionality, it is by far the only known al¬ 
gorithm for numerically approximating constrained dynamic 
programming algorithms with continuous risk updates. This 
paper also leaves important extensions open for further re¬ 
searches that involve randomized grid sampling and variable 
resolution of discretization. 


5 points 
10 points 
20 points 
40 points 
60 points 
80 points 
100 points 
150 points 
Optimal solution 



2.2 2.4 2.6 2.8 

Risk Threshold 


Fig. 1. Convergence of approximated value functions, and errors of 
approximations. 
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